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Abstract. Using 55 years of daily average temperatures from a local weather 
station, I made a least-absolutc-deviations (LAD) regression model that ac- 
counts for three effects: seasonal variations, the 11-year solar cycle, and a 
linear trend. The model was formulated as a linear programming problem and 
solved using widely available optimization software. The solution indicates 
that temperatures have gone up by about 2°F over the 55 years covered by 
the data. It also correctly identifies the known phase of the solar cycle; i.e., 
the date of the last solar minimum. It turns out that the maximum slope of 
the solar cycle sinusoid in the regression model is about the same size as the 
slope produced by the linear trend. The fact that the solar cycle was correctly 
extracted by the model is a strong indicator that effects of this size, in partic- 
ular the slope of the linear trend, can be accurately determined from the 55 
years of data analyzed. 

The main purpose for doing this analysis is to demonstrate that it is easy 
to find and analyze archived temperature data for oneself. In particular, this 
problem makes a good class project for upper-level undergraduate courses in 
optimization or in statistics. 

It is worth noting that a similar least-squares model failed to characterize 
the solar cycle correctly and hence even though it too indicates that temper- 
atures have been rising locally, one can be less confident in this result. 

The paper ends with a section presenting similar results from a few thou- 
sand sites distributed world-wide, some results from a modification of the 
model that includes both temperature and humidity, as well as a number of 
suggestions for future work and/or ideas for enhancements that could be used 
as classroom projects. 

1. Introduction 

Most research on climate change aims to produce a high-fidelity model of climate 
that spans centuries |13j if not millennia . Since directly observed temperature 
data is not available over these time scales, such models are forced to resort to proxy 
climate indicators (see, e.g., [8]). In this paper, actual temperature readings from a 
single undisturbed location spanning a time horizon of 55 years are analyzed using 
a least-absolute deviations (LAD) regression model that robustly extracts a small 
linear trend from the much larger seasonal variations. An analogous least-squares 
regression model generates results that are less reliable than those obtained with 
the LAD model. 

The purpose of this paper is not to attempt to improve on any of the global 
warming estimates that exist in the literature. Rather, the main purpose of this 
paper is to show that it is fairly easy to find and analyze archived temperature data 
with the hope that many others will make a similar analysis for places that are of 
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Figure 1. McGuire Air Force Base 



interest to them. I also hope to inspire educators who teach courses in optimization 
and/or statistics to develop classroom projects based on the ideas presented here. 

Least absolute deviations regression |Tj belongs to a class of statistical techniques 
called robust statistics |10j . The sample median is the simplest and most widely 
used example of a robust statistic. Sample medians have played an important role 
in a wide range of scientific fields including astrophysics T , medicine [3] , and signal 
processing [5] to name a few. A secondary purpose of this paper is to demonstrate 
that, at least for the data considered here, least-absolute deviations regression pro- 
vides better results than the corresponding least-squares regression. 



2. The Data 

The National Oceanic and Atmospheric Administration (NOAA) collects and 
archives weather data from thousands of collection sites around the globe. The 
data format and instructions for downloading the data can be found on the NOAA 
website [15] as can a list of the roughly 9000 weather stations [16] . McGuire Air 
Force Base, located not far from Princeton NJ, is one of the archived weather 
stations. This particular weather station seemed good for a number of reasons: (i) 
it is about 50 miles from New York City and 30 miles from Philadelphia, (ii) it is 
in a rather undeveloped part of the state, (iii) it was established in 1937 and has 
been a major airbase since 1942, and finally, (iv) it is only about 25 miles from 
the Atlantic Ocean and therefore its climate should be moderated somewhat by its 
proximity to an ocean. 
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3. The Model 

Let T d denote the average temperature in degrees Fahrenheit on day d E D 
where D is the set of days from January 1, 1955, to August 13, 2010 (that's 20,309 
days). 

We wish to model the average temperature as a constant xq plus a linear trend 
x\d plus a sinusoidal function with a one-year period representing seasonal changes, 

x 2 cos(27rd/365.25) + x 3 sin(27rd/365.25), 

plus a sinusoidal function with a period of 10.7 years to represent the solar cycle, 

x 4 cos(27rd/(10.7 x 365.25)) + x 5 sin(27rd/(10. 7 x 365.25)). 

The parameters Xq,Xi, . . . ,x$ are unknown regression coefficients. Our aim is to 
find the values of these parameters that minimize the sum of the absolute deviations: 

min \xq + X\d 

X °'-' X5 deD +X2 cos (27rd/365.25) + x 3 sin(27T(i/365.25) 

(1) 

+x A cos(2tt(2/(10.7 x 365.25)) + x 5 sin(27rd/(10.7 x 365.25)) 
-T d \. 

We use the usual trick of introducing a new variable for each absolute value term 
and then adding a pair of constraints that say that this new variable dominates the 
expression that was inside the absolute values and its negative. The result is a linear 
programming (LP) problem. One can check that the solution to the LP formulation 
is identical to the solution to the original model whenever the original problem is 
"convex" . In particular, they are the same whenever one minimizes a nonnegative 
weighted sum of absolute values of linear expressions subject to linear equality and 
inequality constraints ([H], Chapter 12). 

The linear programming problem, expressed in the AMPL modeling language [BJ, 
is shown in Figure [2] AMPL models, with their associated user-supplied data sets, 
can be solved online using the Network Enabled Optimzation Server (NEOS) at 
Argonne National Labs [I]. 

A least-absolute-deviations (LAD) model was chosen instead of a least-squares 
model because LAD regression, like the median statistic, is insensitive to "outliers" 
in the data. The least-squares variant is discussed in Section [6] 

3.1. Confidence Intervals. Let Xq, . . . , x* 5 denote the optimal solution to ([!]) and 
let e d denote the corresponding deviations: 

e d = x* + x* x d + x* 2 cos(27nf/365.25) + x% sin(27rd/365.25) 

+xl cos(27rd/(10.7 x 365.25)) + x\ sin(27rd/(10.7 x 365.25)) - T d . 

In the well-known bootstrap method jS], we assume that these e d s form an empirical 
distribution associated with an unknown underlying error distribution. As such, 
we can sample from these errors and generate new data: 

T' d = x* Q + x\d + xl cos(27rd/365.25) + x* 3 sin(27r<f/365.25) 

+xl cos(27rd/(10.7 x 365.25)) + x* 5 sin(27rd/(10.7 x 365.25)) - e' d 

where the e' d are drawn independently and with replacement from the set {e d : d € 
D}. In this manner we can generate several alternate data sets that are statistically 
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set DATES ordered; 
param avg {DATES} ; 
param day {DATES}; 
param pi := 4*atanCl); 

var x {j in 0. .5}; 

var dev {DATES} >= 0, := lj 

minimize sumdev: sum {d in DATES} dev[d]; 

subject to def_pos_dev {d in DATES}: 

x[0] + x[l]*day[d] + x[2]*cos( 2*pi*day [d] /365 . 25) 
+ x[3]*sin( 2*pi*day[d]/365.25) 
+ x[4]*cos( 2*pi*day[d]/(10.7*365.25)) 
+ x[S]*sin( 2*pi*day[d]/(10.7*365.25)) 

- avg[d] 
<= dev [d] ; 

subject to def_neg_dev {d in DATES}: 
-dev[d] <= 

x[0] + x[l]*day[d] + x[2]*cos( 2*pi*day [d] /365 . 25) 
+ x[3]*sin( 2*pi*day[d]/365.25) 
+ x[4]*cos( 2*pi*day[d]/(10.7*365.25)) 
+ x[5]*sin( 2*pi*day[d]/(10. 7*365. 25)) 

- avg [d] ; 

data; 

set DATES := include "data/Dates. dat " ; 
param: avg := include "data/McGuireAFB.dat" ; 
let {d in DATES} day[d] := ord (d, DATES) ; 

solve ; 



Figure 2. The model expressed in the AMPL modeling language. 



similar to the original one and we can then recompute the parameters xq,. .. ,x§ 
using these replicated data sets to get multiple estimates for each parameter and 
thereby compute 2cr-confidcnce intervals for each parameter (or any other derived 
parameter). 

4. The Results 

The linear programming problem can be solved in only a few minutes on a 
modern laptop computer. The optimal values of the parameters together with 
their 2cr-confidence intervals are 

52.6 ±0.27°F 

3.63 ± 0.75 °F/century 
20.4 ±0.16°F 
-8.31 ±0.17°F 
-0.197 ±0.137 °F 
0.211 ±0.202 °F 

4.1. Linear Trend. From x n , we see that the nominal temperature at McGuirc 
AFB was 52.56 ± 0.27 °F (on January 1, 1955). 



Xq — 

Xi = 

X2 = - 

X 3 = 

£4 = 

x 5 = 
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We also see, from x x , that there is a positive trend of 0.000099± 0.000020 °F/day, 
which scales to 3.63 ± 0.75 °F per century. This result is consistent with results 
from global climate change models, which predict a per century warming rate of 
between 2.0 °C and 2.4 °C ([HIS])- 

4.2. Amplitude of the Sinusoidal Fluctuations. From the sine and cosine 
terms X2 and X3, we can compute the amplitude of annual seasonal changes in 
temperatures: 



In other words, on the hottest summer day we should expect the temperature to 
be 22.02 degrees warmer than the nominal value of 52.56 degrees; that is, 74.58 
degrees. Of course, this is a daily average — daytime highs can be expected to be 
higher and nighttime lows lower by about the same amount. 

Similarly, from the x^ and x§ sine and cosine terms, we can compute the ampli- 
tude of the temperature changes brought about by the solar-cycle: 



The effect of the solar cycle is real but relatively small. 

4.3. Phase of the Sinusoidal Fluctuations. Close inspection of the output 
shows that January 22 is nominally the coldest day in the winter and July 24 
is the hottest day of summer. The plus/minus 2cr-error on these estimates is less 
than half a day. It is perhaps worth noting that the coldest days in the winter of 
2011 turned out to be January 23 and 24. 

According to the LAD model, February 12, 2007, was the day of the last min- 
imum in the 10.7 year solar cycle. In this case, the plus/minus 2cr-error on this 
estimate is about 400 days. It is well-known that the solar cycle had its last min- 
imum in 2007 [171. The correct extraction of the phase (and, in a later section, 
the period) of the solar cycle, which is a small effect having an amplitude of only 
0.29 °F, is strong support for fidelity of the LAD model. 

4.4. Visualizing the Results. Figure [3] shows a plot of all 20,309 data points. 
Overlaid on these data points is the solution of the LAD regression model. Daily 
and seasonal fluctuations completely dominate other effects. It is impossible to 
"see" any linear warming trend or the solar cycle. 

Figure [4] has the seasonal and solar-cycle variations removed. Even this plot is 
noisy. Of course, there are many days in a year and some days are unseasonably 
warm while others are unseasonably cool. It is not uncommon for there to be an 
"unseasonably warm" day that is 20 or even sometimes 30 degrees above seasonally 
adjusted averages. 

Figure [5] is derived from Figure [4] by applying a 101 day rolling average to each 
data point. The rolling average reduced the extreme fluctuations to about 1/10-th 
their original amplitude thus making the linear warming trend quite apparent. In 
NJ we have local warming. 



The length of the solar cycle is only approximately 10.7 years [21]. We can modify 
the model to predict, in addition to the parameters already being estimated, the 





5. Estimating the Period of the Solar Cycle 
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Average Daily Temperatures at McGuire AFB 




1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 



Figure 3. Plot Showing Actual Data and Regression Curve. 
Blue: Average daily temperatures at McGuire AFB from 1955 to 
2010. Red: Output from least absolute deviation regression model. 



Average Daily Temperature Minus Seasonal Cycle and Solar Cycle at McGuire AFB 




1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 



Date 

Figure 4. As before but with sinusoidal seasonal variation re- 
moved and sinusoidal solar-cycle variation removed as well. 



period of this cycle. To do this, we introduce one new variable xq and change the 
solar-cycle sine and cosine terms to read: 

x 4 cos(2x 6 7rd/(10.7 x 365.25)) + x 5 sin(2x 6 7rd/(10.7 x 365.25)). 

If the unknown parameter x§ is fixed at 1, forcing the solar-cycle to have a period 
of exactly 10.7 years, then the problem reduces to the linear programming problem 
considered earlier. If, on the other hand, we allow xe to vary, then the problem is 
nonlinear and even nonconvex and therefore potentially harder to solve. However, 
nonlinear (local) optimization algorithms produce provably locally optimal solu- 
tions "near" to the initially provided values for the variables. Hence, for problems 
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101 Day Running Average 




1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 

Date 

Figure 5. Smoothed Seasonally Subtracted Plot. To smooth out 
high frequency fluctuations, we use 101 day rolling averages of the 
data. 



in which rough estimates of the optimal values are known, one can expect such 
algorithms to produce the desired solution. Such is the case with the problem at 
hand. The result is that the first six parameters remain virtually unchanged and 
xq = 0.992 which translates to a 10.78 year solar cycle, in close agreement with the 
nominal value of 10.7 years. One might argue that it is unfair to initialize xq so 
that the starting point is so close to the known correct solution. Tabic [T] shows the 
period of the solar cycle obtained using various initial guesses. 



6. Least Squares Solution (Mean instead of Median) 
Suppose we change the objective to a sum of squares of deviations: 



minimize sumdev: sum {d in DATES} dev[d]~2; 

The resulting model is a (nonlinear) least squares model. The problem is no 
longer representable as a linear programming problem, but its objective function 
is still convex and the problem is still easy to solve using nonlinear optimization 
software. Fixing x§ to one (i.e., fixing the solar cycle to 10.7 years), the problem 
becomes a linear least squares regression model that can also be solved using any 
number of statistical packages, such as R 20J. The solution, however, is sensitive 
to outliers, which may or may not be present. Here's the output associated with 
the nonlinear least squares formulation: 



8 



ROBERT J. VANDERBEI 



Initial period (in years) Locally optimal period (in years) 





LAD 


Least Squares 


G 


6.64 


14.65 


7 


10.78 


6.53 


8 


10.78 


6.53 


9 


5.67 


6.53 


10 


8.12 


8.75 


10.7 


10.78 


14.65 


11 


-14.61 


14.65 


12 


4.19 


14.65 


13 


-14.61 


5.74 


14 


10.78 


14.65 


15 


-10.78 


oo 



Table 1. The derived period, in years, of the solar cycle com- 
puted using different initial guesses for variable x§ in the nonlinear 
versions of both the LAD and least-squares optimization models. 
In the table values of Xq are specified by giving the period of the 
sinusoid measured in years (recall that x e = 1 corresponds to a 
10.7 year solar cycle). Note that, for the least absolute deviations 
model, four of the eleven cases considered produced an answer close 
to the known answer whereas none of them are very close using the 
least squares model. 



52.6 °F 

1.2 x 10~ 4 °F/day 
-20.3 °F 
-7.97 °F 
0.275 °F 
0.454 °F 
0.730 

In this case, the rate of local warming is 4.37 °F per century. This number lies 
near the upper limit of the confidence interval given before. As Table [T] shows, the 
model also produces a wrong answer for the period of the solar cycle for each of the 
eleven values I used to initialize xq including Xq = 1, which corresponds to a solar 
cycle of 10.7 years. It is well known that the period of the solar cycle has been 
about 10.7 years for the past few centuries. 

7. Humidity 

It is well-known that higher temperatures imply more evaporation of water into 
the atmosphere; that is, increased temperature implies increased humidity. It turns 
out that the NOAA data sets contain not only temperature data but also dew point 
data. Dew point is the temperature below which water in the atmosphere condenses 



x = 

Xl = 

X2 = 

x 3 = 

X4 = 

x 5 = 

x 6 = 
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out to form clouds/fog. It is a quantitative measure of the amount of water in the 
atmosphere — the more water vapor, the higher the temperature must be for the 
air to hold that water vapor. It is common in summer months for there to be fog 
in the morning. The temperature and the dew point match when it is foggy. One 
says that the relative humidity is 100% at such times. In the summer months at 
mid-latitudes, dew points are often up in the 60 to 70 degrees Fahrenheit range. In 
the winter, obviously, the dew point is much lower — winter air is significantly dryer 
than summer air. Anyway, while weathermen typically report humidity in terms of 
relative humidity, dew point is a more direct measure of the amount of water vapor 
in the atmosphere. 

Given that NOAA tabulates this measure in the same data sets as used above, 
it turns out to be almost trivial to modify the scripts to perform a dew point 
regression. What we discover is that, in 1955 at McGuire Airforce Base, the nominal 
dew point (given by Xq) was 41.4 °F and that on the dampest summer days it was 
expected to be 22.6 °F higher and on the driest winter days it was expected to 
be that much lower. Furthermore, dewpoint is going up at a rate of 5.4 °F per 
century — a rate even greater than the rate that temperatures are rising, which 
means that relative humidity is also increasing. 

In the US, the National Weather Service reports a number called heat index on 
hot summer days. This index combines temperature and humidity into a single 
number on a temperature scale. However, this measure is fairly subjective as it is 
based on relative humidity and aims to approximate how the temperature "feels" 
to a person. In Canada, a similar measure called humidex is used ( |141 118] ). An 
advantage of the humidex over heat index is that humidex is based on temperature 
and dew point rather than temperature and relative humidity. In fact, the formula 
for humidex H (in degrees Fahrenheit) is rather simple: 

H = T + e.lle 5417 - 753 ^^-*) _ io, 

where T is temperature in degrees Fahrenheit and D is dew point in degrees Kelvin 
(of course, it is easy to convert degrees Fahrenheit to degrees Kelvin). 

Again, the fact that NOAA tabulates dew point together with temperature data 
makes it easy to produce values for the humidex. What we discover is that, in 1955 
at McGuire Airforce Base, the nominal humidex (given by xq) was 52.9 °F and that 
on the hottest/dampest summer days it was expected to be 30.8 °F higher and on 
the coldest/driest winter days it was expected to be that much lower. Furthermore, 
humidex is going up at a rate of 6.5 °F per century — a rate even greater than the 
rate that temperatures are rising. 

8. Going Global 

For fun, I applied the model to thousands of local weather stations around the 
world to produce a global "local warming map" . Specifically, data files were down- 
loaded from every NOAA weather station for which collection commenced prior to 
Jan 1, 1955 and is currently in ongoing. There may be, and it turns out that there 
usually are, gaps in the data — sometimes just a day here or there is missing but 
often there are multi-year gaps. Hence, the download script stipulated that the 
location must have collected at least 3650 days of data (i.e., 10 years worth). The 
resulting map is shown in Figure [6j 
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Figure 6. A map of 2399 local warming results distributed 
around the globe. In units of degrees Fahrenheit per century, the 
mean of these values is 4.18, the median value is 4.53, and the stan- 
dard deviation is 2.94. Dividing by the square root of the sample 
size, a confidence interval for the mean is [4.12,4.24]. 



I should mention that no attempt was made to filter out bad data. Also, seasonal 
variations are not sinusoidal in the tropics and so the simple model presented earlier 
is not very good at tropical latitudes. 

One can clearly see from Figure [6] that warming is more pronounced in the 
northern hemisphere. Such an observation is consistence with the hypothesis that 
warming over the last century is largely anthropogenic. 

9. Final Remarks 

It is remarkable that the solar cycle can be seen and a warming trend can be 
extracted from just one weather station's 55-year dataset. 

In producing confidence intervals, one of our assumptions is wrong. In the boot- 
strap method, we assumed that the temperature fluctuations from day to day are 
independent. In reality they are correlated over short time intervals — the corre- 
lation length is probably about a week or so. This error makes our confidence 
intervals too tight. If we know, or can guess, the correlation length m, then we can 
scale the estimated standard deviation by the square root of m. For example, if we 
assume that the correlation length is 9 days, then the new confidence interval for 
x 1 is [1.38 °F, 5.88 °F] per century. 
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The original model can be improved in a few key ways. First of all, the assump- 
tion that the seasonal variation is sinusoidal is only an approximation — it falls apart 
for data collection sites in the tropics where, in principle, each year has two dates 
at which the Sun passes directly overhead and two dates in between when the Sun 
is furthest (to the north/south) from passing overhead. Also, the linear trend could 
be modeled as a function of global (or local) population density. Over 55 years, 
such a function is probably fairly, but certainly not exactly, linear. Hence, the basic 
models we have considered can be improved in various ways. Such improvements 
could inspire many interesting student projects. 

9.1. Getting the Data. Since the NOAA data is archived in one year batches, 
I wrote a UNIX shell script to grab the 55 annual data files for McGuire and then 
assemble the relevant pieces of data into a single file. Here is the shell script: 

http://www.princeton.edu/~rvdb/ampl/nlmodels/LocalWarming/McGuireAFB/data/getData.sh 

The resulting pair of data files that I used as input to my local climate model are 
posted at: 

http://www.princeton.edu/~rvdb/ampl/nlmodels/LocalWarming/McGuireAFB/data/McGuireAFB.dat 



and 



http://www.princeton.edu/~rvdb/ampl/nlmodels/LocalWarming/McGuireAFB/data/Dates.dat 
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